# Load packages
library(tidyverse)
library(survival)
library(survminer)
# Load Data
ninja <- read_csv("anw_2021_stage1.csv")American Ninja Warrior - Kaplan-Meier Survival Analyis
Welcome video
Introduction
In this module, we will explore American Ninja Warrior data from the 2021 stage 1 finals through the lens of survival analysis. Survival analysis is a statistical method used to analyze the time until an event of interest occurs (often this event is death). In this case, the obstacles will be our time, and the event of interest will be DEATH (we’re calling failing off an obstacle death).
American Ninja Warrior Logo
Imagage Source Wikimedia
One of the simplest methods for survival analysis is the Kaplan-Meier method. This method estimates the survival function, which is the probability that a subject survives past a certain time. The Kaplan-Meier method is non-parametric, meaning it makes no assumptions about the distribution of the survival times. It is a step function that decreases at each event time.
NOTE: R is the name of the programming language itself and RStudio is a convenient interface. To throw even more lingo in, you may be accessing RStudio through a web-based version called Posit Cloud. But R is the programming language you are learning :)
Getting started: American Ninja Warrior data
The first step to any analysis in R is to load necessary packages and data.
Running the following code will load the tidyverse, survival, and survminer packages and the anw_2021_stage1 data we will be using in this lab.
TIP: As you follow along in the lab, you should run each corresponding code chunk in your .qmd document. To “Run” a code chunk, you can press the green “Play” button in the top right corner of the code chunk in your .qmd. You can also place your cursor anywhere in the line(s) of code you want to run and press “command + return” (Mac) or “Ctrl + Enter” (Windows).
TIP: If you get an error while attempting to load the library or data, you may need to install the package or check the file path. You can install packages using the install.packages() function in R. Once you have installed a package, you can load it into your R environment using the library() function. If that is not the issue, you may need to check the file path to ensure the data is in the correct location.
We can use the glimpse() function to get a quick look at our ninja data. The glimpse code provides the number of observations (Rows) and the number of variables (Columns) in the dataset. The “Rows” and “Columns” are referred to as the dimensions of the dataset. It also shows us the names of the variables and the first few observations for each variable.
glimpse(ninja)Rows: 69
Columns: 5
$ name <chr> "Meagan Martin", "DeShawn Harris", "Heather Weissinger…
$ sex <chr> "F", "M", "F", "M", "F", "F", "M", "M", "M", "F", "F",…
$ obstacle <chr> "Slide Surfer", "Slide Surfer", "Swinging Blades", "Sw…
$ obstacle_number <dbl> 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, …
$ cause <chr> "Fall", "Fall", "Fall", "Fall", "Fall", "Fall", "Fall"…
Another useful function to get a quick look at the data is the head() function. This function shows the first few rows of the dataset.
head(ninja)# A tibble: 6 × 5
name sex obstacle obstacle_number cause
<chr> <chr> <chr> <dbl> <chr>
1 Meagan Martin F Slide Surfer 1 Fall
2 DeShawn Harris M Slide Surfer 1 Fall
3 Heather Weissinger F Swinging Blades 2 Fall
4 Cam Baumgartner M Swinging Blades 2 Fall
5 Brittney Durant F Swinging Blades 2 Fall
6 Casey Rothschild F Swinging Blades 2 Fall
TIP: Type your answers to each exercise in the .qmd document.
Terms to know
Before proceeding with the survival analysis, let’s make sure we understand American Ninja Warrior and some of it’s vocabulary to help us climb our way through this lab.
How does American Ninja Warrior work?
American Ninja Warrior is an NBC competition show where competitors attempt to complete a series of obstacle courses of increasing difficulty. In a single obstacle course, the competitors must complete a series of obstacles in a row. If they fail an obstacle, they are eliminated from the competition. The competitors also have a time limit to complete the course. The competitors are ranked based on how far they get in the course and how quickly they complete it.
Most of the obstacles are designed to test the competitors’ upper body strength. Some obstacles require balance and agility though.
The warped wall is arguably the most famous, although now least difficult, obstacle on an American Ninja Warrior course. The warped wall is a curved wall that competitors must run up and grab the top of. The warped wall is on every course and is often the final obstacle, although this is not the case on the Finals courses.
The warped wall which was previously 14 feet and is now 14.5 feet tall. They have even had a 18 foot warped wall on the show.
Warped Wall
Image Source: SB Nation
The obstacles in American Ninja Warrior are all given names. For example, the famed Warped Wall is a curved wall that competitors must run up and grab the top of. The Salmon Ladder is a series of rungs that competitors must move up by jumping and pulling themselves up.
Watch Enzo Wilson complete the American Ninja Warrior Course at the 2021 Finals Stage 1 (Season 13) in the video below.
Variable Descriptions
The ninja data you’ll be analyzing in this lab provides the individual run information for each ninja in the 2021 Finals Stage 1 (Season 13). The data includes the ninja’s name, their sex, the obstacle they failed on, and the cause of that failure.
Variable Descriptions
| Variable | Description |
|---|---|
name |
Name of the American Ninja Warrior |
sex |
Sex of the American Ninja Warrior (M/F) |
obstacle |
The name of the obstacle the ninja failed on |
obstacle_number |
The obstacles’ place in the run |
cause |
What caused the ninja to fail (Fall/Time/Complete) |
Points of Confusion
Use the following code to create a table of the obstacle names and their corresponding obstacle numbers. This will help you understand the order of the obstacles in the course.
ninja |>
distinct(obstacle, obstacle_number)# A tibble: 10 × 2
obstacle obstacle_number
<chr> <dbl>
1 Slide Surfer 1
2 Swinging Blades 2
3 Double Dipper 3
4 Jumping Spider 4
5 Tire Run 5
6 Dipping Birds 7
7 The High Road 8
8 Fly Hooks 8
9 Cargo Net 9
10 Complete 10
It can be seen quickly that there is no obstacle number 6 in the data and that there are 2 obstacles with number 8. Both of these appear confusing at first, but they have logical explanations.
The missing obstacle number 6 is due to the fact that no ninja failed on obstacle 6. Some quick research tells us that obstacle 6 was the warped wall.
Read about the 2021 Stage 1 Split Decision here.
The duplicate obstacle number 8 is due to the fact that Stage One of the 2021 Finals allowed a split-decision. This means that competitors could choose between two different obstacles for obstacle 8.
Another important things to note are that obstacle 10 is not really an obstacle, but rather the end of the course.
Lastly, one competitor Joe Moravsky ran the course twice (falling the first time and completing it the second time). This is because he was the Safety Pass Winner from the previous round. The Safety Pass allows a competitor to run the course again if they fail the first time.
Kaplan-Meier Survival Analysis
Censored Data
In survival analysis, we often have need to censor data. Censored data occurs when the event of interest has not occurred for some of the observations. In our case, the event of interest is the failure of a competitor on an obstacle. If a competitor completes the course, we do not know which obstacle they would have failed on. Additionally if the time limit is reached, we do not know which obstacle the competitor would have failed on, although we do know that they didn’t fail on any of the ones they completed before time was up. Both of these situations are examples of censored data.
Use the code below to create a new column called censor in the ninja data that is a binary indicator of whether or not the observation should be censored. This column will be used to indicate whether the data is censored or not.
# Makes a column called censor that is 1 if the competitor failed and 0 if they completed the course or ran out of time
ninja <- ninja |>
mutate(censor = if_else(cause %in% c("Complete", "Time"), 0, 1))Calculating Survival Probabilities
The Kaplan-Meier estimator uses information from all of the observations in the data and considers survival to a certain point in time as a series of steps defined at the observed survival times.
In order to calculate the probability of surviving past a certain point in time (past a certain obstacle in this case), the conditional probability of surviving past that point given that the competitor has survived up to that point must be calculated first.
The formula for conditional probability of surviving past a point in time (\(t_i\)) given that the competitor has survived up to that point in time(\(t_{i-1}\)) is:
Note: This function could also be written as \(P(T > t_i | T \geq t_{i-1}) = \frac{n_i - d_i}{n_i}\)
\(P(T \geq t_i | T \geq t_{i-1}) = 1- \frac{d_i}{n_i}\)
Where:
\(d_i\) is the number of competitors that failed at time \(t_i\)
\(n_i\) is the number of competitors that were at risk at time \(t_i\)
The Kaplan-Meier estimator is the product of the conditional probabilities of surviving past each point in time up through that point in time.
\(\hat{S}(t) = \prod_{t_i \leq t} (1 - \frac{d_i}{n_i})\)
Note: Censored data does not count in the at risk competitors
where \(n_i = n_{i-1} - d_{i-1} - c_{i-1}\)
For example if we have this data:
# Setting a seed for reproducibility
set.seed(123)
# Creating fake data
fake_data <- tibble(obstacle_number = c(1:5, 4:5), censor = c(rep(1, 5), rep(0, 2))) |>
sample_n(25, replace = TRUE)
head(fake_data)# A tibble: 6 × 2
obstacle_number censor
<int> <dbl>
1 5 0
2 5 0
3 3 1
4 4 0
5 3 1
6 2 1
And we wanted to calculate the Kaplan-Meier estimate of surviving past obstacle 2 we would need to find the following probabilities:
\(P(T > 1 | T > 0) = P(T > 1) = 1 - \frac{\text{number of competitors that failed at obstacle 1}}{\text{number of competitors that attempted obstacle 1}}\)
\(P(T > 2 | T > 1) = 1 - \frac{\text{number of competitors that failed at obstacle 2}}{\text{number of competitors that attempted obstacle 2}}\)
The following code calculates the first two probabilities:
fake_data_summary <- fake_data |>
filter(obstacle_number <= 2) |>
group_by(obstacle_number) |>
summarize(fails = sum(censor == 1)) |>
ungroup() |>
mutate(at_risk = 25 - cumsum(fails),
cond_prob = 1 - fails/at_risk)
fake_data_summary# A tibble: 2 × 4
obstacle_number fails at_risk cond_prob
<int> <int> <dbl> <dbl>
1 1 4 21 0.810
2 2 3 18 0.833
We would then need to multiply these two probabilities together to get the Kaplan-Meier estimate of surviving past obstacle 2.
\(P(T > 2) = P(T > 1) * P(T > 2 | T > 1)\)
The following code calculates the Kaplan-Meier estimate of surviving past obstacle 2:
kaplan_meier <- fake_data_summary |>
summarize(kaplan_meier = prod(cond_prob))The Kaplan-Meier estimate of surviving past obstacle 2 in this fake example is 0.6746032.
Kaplan-Meier Estimator Manual Calculation
The ninja data frame contains information about individual competitors in the ninja competition. We will need to summarize the data to calculate the Kaplan-Meier estimator manually.
TIP: You can pipe the data frame into the mutate function to create a new column.
TIP: The mutate function works like so: data_frame |> mutate(new_column = calculation)
Type ?geom_step, ?geom_point, or ?ggplot in the console to learn more about these functions.
TIP: Remember that you can add the + operator to continue adding layers to the plot like seen below
ggplot(your_data, aes(x = time_var, y = kaplan_meier_var)) +
geom_step() +
geom_point()TIP: You can also add labels to the plot using the labs function like seen below
ggplot(your_data, aes(x = time_var, y = kaplan_meier_var)) +
geom_step() +
geom_point() +
labs(title = "Your Title",
x = "X Axis Label",
y = "Y Axis Label")Using R to Calculate the Kaplan-Meier Estimator
Phew! That was a lot of tedious work to calculate and plot the Kaplan-Meier estimator manually. Luckily, there is a much easier way to calculate the Kaplan-Meier estimator using R.
The survival package in R provides a function called survfit that can be used to calculate the Kaplan-Meier estimator. The survfit function requires a Surv object as input. The Surv object is created using the Surv function, which requires two arguments:
The time to event data. The time to event data is the time at which the event occurred or the time at which the individual was censored. In our case this is the obstacle_number in our
ninjadata.The event status. The event status is a binary variable that indicates whether the event occurred or the individual was censored. The event status is coded as 1 if the event occurred and 0 if the individual was censored. This is contained in the
censorcolumn of theninjadata.
Below a survfit model is created for the ninja.
ninja_km <- survfit(Surv(obstacle_number, censor) ~ 1, data = ninja)The computations for calculating the Confidence Interval for the K-M Estimate are fairly complex. The method most commonly used is called the log-log survival function and was proposed by Kalbfleisch and Prentice (2002). This function is computed by \(ln(-ln[\hat{S}(t)])\) with variance derived from the delta method and calculated by \[ \frac{1}{[ln(\hat{S}(t))]^2}\sum_{t_i\leq{t}}\frac{d_i}{n_i(n_i - d_i)} \].
The endpoints for the confidence interval for the log-log survival function are therefore found by \(ln(-ln[\hat{S}(t)]) \pm Z_{1-\alpha / 2} SE [ln(-ln[\hat{S}(t)]) ]\)
And the endpoints expressed by the computer and seen in the summary are \(exp[-exp(\hat{c}_u)] \text{ and } exp[-exp(\hat{c}_l)]\)
Quartile Interpretation
The three quartiles are common statistics to look at when doing a survival analysis. The interpretations of these are as follows:
Note: If the data is uncensored the estimate is just the median of the data. If the data is censored, the KM estimate is used to find these by finding the time at which it drops below the percentile
25th Percentile- 75% of the people survive past this point in time
Median- 50% of the people will survive past this time
75th Percentile- 25% survive past this time
Plotting with R
After fitting a Kaplan-Meier model, we can use the ggsurvplot function from the survminer package to plot the Kaplan-Meier estimator. The ggsurvplot function requires the Kaplan-Meier model as input.
Below is an example of how easy it is to plot the Kaplan-Meier estimator using R.
ggsurvplot(ninja_km,
conf.int = TRUE)Kaplan-Meier Estimator by Groups
Sometimes we may want to compare the survival probabilities of different groups of individuals. For example, we may want to compare the survival probabilities base on age, gender, treatment group, or a variety of other factors.
In our data set, we have a variable called sex that indicates the gender of the competitor. We can use this variable to compare the survival probabilities males and females.
The Kaplan-Meier estimator can be calculated for different groups of individuals quite easily in r. We simply have to add the variable sex to our function in the model to do this.
ninja_km_gender <- survfit(Surv(obstacle_number,
censor)~ sex, data = ninja)A plot can be created to help us compare these groups.
Fun Fact: Jesse Labreck was the only woman to complete the 2021 American Ninja Warrior Stage 1 Finals Course.
You can watch her incredible run below
ggsurvplot(ninja_km_gender,
conf.int = TRUE)Checking the differences between the groups
One way of checking the differences between the groups is to use the survdiff function in the survival package. The survdiff function can be used to test the null hypothesis that the survival probabilities of the two groups are equal. The survdiff function requires a Surv object as input.
Note: The log-rank test is a non-parametric test. This means that it does not assume that the data is normally distributed.
The test that the survdiff function performs is a log-rank test. The formula for the test statistic of the log-rank test is:
\[X^2 = \sum_{j=1}^{k} \frac{(O_j - E_j)^2}{V_j} \]
where
- \(O_j\) is the observed number of events in group \(j\)
- \(E_j\) is the expected number of events in group \(j\)
- \(V_j\) is the variance of \(O_j - E_j\)
- \(k\) is the number of groups
ninja_km_diff <- survdiff(Surv(obstacle_number,
censor) ~ sex, data = ninja)Other Log Rank Based Tests
Although the survdiff function uses the most common log-rank test, there are a variety of different methods for comparing survival between two groups.
These methods are all similar to a standard log-rank test but weight time points in order to detect differences better. Other common methods include:
- Gehan-Breslow method: A generalized Wilcoxon test
- The Tarone-Ware method: Detects early differences
- The Peto-Prentice method: Uses weighted time points based on the pooled survival estimates. More robust than TW or GB. Detects early differences.
- The Fleming-Harrington method, which is another weighted version of the log-rank test.
The surv_pvalue function in the survival package can be used to calculate the p-value for all of these tests by changing the method argument.
More Practice
Read in the anw_2023_stage1 data to complete the following problems.
The winner of the 2023 American Ninja Warrior competition was Vance Walker. Walker won $1 million for his victory.
Image Source: NBC
Fun Fact: Thread the Needle, the 8th obstacle on the 2021 Finals Stage 1 course, saw the highest failure rate of any obstacle. It had not previously been used on the Finals Stage 1 courses.
Image Source: Sasukepedia
Click here to read more about Edward Kaplan and Paul Meier’s work and how it has impacted the field of Statistics, particularly in regards to biostatistics.